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Abstract: To identify novel dynamic patterns of gene expression, we de- 
velop a statistical method to cluster noisy measurements of gene expression 
collected from multiple replicates at multiple time points, with an unknown 
number of clusters. We propose a random-effects mixture model coupled 
with a Dirichlet-process prior for clustering. The mixture model formulation 
allows for probabilistic cluster assignments. The random-effects formulation 
allows for attributing the total variability in the data to the sources that are 
consistent with the experimental design, particularly when the noise level is 
high and the temporal dependence is not strong. The Dirichlet-process prior 
induces a prior distribution on partitions and helps to estimate the number 
of clusters (or mixture components) from the data. We further tackle two 
challenges associated with Dirichlet-process prior-based methods. One is 
efficient sampling. We develop a novel Metropolis-Hastings Markov Chain 
Monte Carlo (MCMC) procedure to sample the partitions. The other is effi- 
cient use of the MCMC samples in forming clusters. We propose a two-step 
procedure for posterior inference, which involves resampling and relabeling, 
to estimate the posterior allocation probability matrix. This matrix can be 
directly used in cluster assignments, while describing the uncertainty in 
clustering. We demonstrate the effectiveness of our model and sampling 
procedure through simulated data. Applying our method to a real data set 
collected from Drosophila adult muscle cells after five-minute Notch acti- 
vation, we identify 14 clusters of different transcriptional responses among 
163 differentially expressed genes, which provides several novel insights into 
underlying transcriptional mechanisms in the Notch signaling pathway. The 
algorithm developed here is implemented in the R package DIRECT. 

Keywords and phrases: Bayesian clustering, mixture model, random 
effects, Dirichlet process, Chinese restaurant process, Markov-chain Monte 
Carlo (MCMC), label switching, multivariate analysis, time series, microar- 
ray gene expression. 



1. INTRODUCTION 

We are interested in the dynamics of the transcriptional response to activation of 
the Notch signahng pathway (Housden et al., 2012). During transcription, RNA 
molecules are produced using the DNA sequence of the genes as templates, lead- 
ing to the notion of these genes being "expressed". Some of the RNA molecules, 
mRNA specifically, are subsequently translated into proteins, which directly reg- 
ulate all kinds of biological processes. Notch and several other proteins form the 
Notch signaling pathway, which transmits signals between cells through regu- 
lating transcription in neighboring cells. Critical to the normal development of 
many organisms, Notch and its signaling pathway are under active and exten- 
sive investigation (see Bray 2006 for a review). Using Drosophila as the model 
organism, we aim to characterize patterns of the transcriptional responses of 
the whole genome following a pulse of Notch activation (Housden et al., 2012). 

* Supported in part by BBSRC grant BBF00897X to SJB, SR and ST. 
tAlso at the University of Southern California. Supported in part by NIH grant P50 
HG002790. 



A. Q. Fu et al./Bayesian Clustering 



3 



Although some Notch target genes have been identified in Drosophila (Jennings 
et al., 1994; Krejci et al., 2009), it is unclear whether other genes are also tar- 
gets of Notch, whether target genes can be activated (increased expression) or 
repressed (decreased expression) only by a pulse of activation, and what expres- 
sion patterns these genes exhibit following Notch activation. To generate the 
data we analyze here. Notch signaling was initiated in Drosophila adult mus- 
cle cells and stimulated for a short pulse of 5 minutes, and mRNA levels were 
measured in these treated cells relative to untreated cells using microarrays for 
4 biological replicates at 18 unevenly-spaced time points during the 150 min- 
utes after activation (Housden 2011; Housden et al. 2012; also see Section 5 for 
details on the experiment and pre-processing of the data). We aim to address 
the following questions for the 163 differentially expressed genes: (i) how many 
different expression patterns are there and what are these patterns? and (ii) 
which genes exhibit what expression pattern? These questions naturally call for 
a clustering approach to analyzing these data. 

However, there are several challenges associated with this data set. Firstly, 
these data are different from the conventional time series data. Time series 
often refer to the measurements of a single subject over time. In the microarray 
experiment, however, a biological replicate refers to a population of cells, and 
the expression levels at any time point are measured for a distinct sample of cells 
from the same population. Although the cells from the same biological replicate 
are typically assumed to be homogeneous, the heterogeneity among cells is non- 
negligible and contributes to the noise in the data (Spudich and Koshland, 
1976; McAdams and Arkin, 1997; Elowitz et al, 2002). Secondly, since only 
a pulse of Notch activation was applied, the level of expression in our data is 
often not much different from (Fig. 1) and the temporal dependence is weak. 
Specifically, the mean expression level across time points and across replicates 
is only 0.1 with a standard deviation (SD) of 0.5, leading to a signal-to-noise 
ratio of only ~0.2. Meanwhile, the median of the lag 1 autocorrelation across 
replicates is only 0.4 (interquartile range or IQR: 0.2-0.6). Thirdly, existing 
clustering software programs such as MCLUST (Fraley and Raftery, 2002, 2006) 
and SplineCluster (Heard, Holmes and Stephens, 2006) give vastly different 
results (see Section 5 for detail). 

These scientific questions and the challenges in the data thus motivated the 
clustering method we develop here. Our clustering method consists mainly of 
a random-effects mixture model and a Dirichlet-process prior. We propose the 
random-effects model to tackle the high level of noise in the data that arise 
from several sources. Under the random-effects model, we make use of the full 
data, rather than reducing the data to the means across replicates, which may 
not be accurate with this level of noise. Under this model, we also do not make 
many assumptions about the underlying process, which is still largely unknown. 
Novel patterns detected this way are unlikely to be the result of potentially 
inappropriate assumptions. The use of a Dirichlet-process prior enables us to 
estimate the number of clusters directly from the data. Below we review existing 
relevant work, which laid the foundation for our method. 

Most clustering methods that are shown to be effective on time-course data 
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Fig 1. Mean profiles of 163 significantly expressed genes (false discovery rate 10% by EDGE; 
Storey et al. 2005) over the time course of 18 time points. 



are model-based, with the distribution following a mixture of multivariate Gaus- 
sian components (Fraley and Raftery, 2002; Medvedovic and Sivaganesan, 2002; 
Mcdvcdovic, Young and Burngarner, 2004; Cclcux, Martin and Lavergne, 2005; 
Beal and Krishnamurthy, 2006; Fraley and Raftery, 2006; Heard, Holmes and 
Stephens, 2006; Ma et al., 2006; Qin, 2006; Zhou and Wakefield, 2006; Lau 
and Green, 2007; Booth, CascUa and Robert, 2008; Rasmussen et al., 2009; 
McNicholas and Murphy, 2010; Green, 2010). Different methods take different 
approaches to modeling the mean vectors and covariance structures. Several 
methods attempt to account specifically for the temporal dependence by mod- 
eling the (prior) mean vector in terms of spline functions (Heard, Holmes and 
Stephens, 2006; Ma et al., 2006) or as a random walk (Zhou and Wakefield, 
2006). As for the covariance structure, some methods (Medvedovic and Siva- 
ganesan, 2002; Medvedovic, Yeung and Burngarner, 2004; Heard, Holmes and 
Stephens, 2006; Qin, 2006; Lau and Green, 2007; Green, 2010) assume inde- 
pendence across items and across time points a priori. Both Fraley and Raftery 



A. Q. Fu et al./Bayesian Clustering 



5 



(2006) and McNicholas and Murphy (2010) take a matrix decomposition ap- 
proach and consider various models for the covariance matrix by constraining 
no or some decomposed terms to be identical across clusters. However, whereas 
Fraley and Raftcry (2006) apply eigenvalue decomposition, which is applicable 
also to data types other than time-course data, McNicholas and Murphy (2010) 
use a modified Cholesky decomposition, which has connections with autore- 
gressive models and is thus specifically designed for time-course data. Another 
common approach to modeling the covariance structure is random-effects mod- 
els, which account for variability arising from different sources (Celcux, Martin 
and Lavergne, 2005; Ma et al., 2006; Booth, Casella and Hobert, 2008). This is 
the approach we take in our clustering method. Indeed, with a random-effects 
mixture model, we demonstrate that specific modeling of the temporal structure 
may not be essential for clustering replicated time-course data. 

Estimating the number of clusters, or mixture components, under a model- 
based framework, has been a difficult problem. Several approaches exist, largely 
falling into two categories: optimization for a single "best" partition and a fully 
Bayesian approach that weights the partitions by their probabilities given the 
data. In the optimization category, the penalized likelihood approach, using 
criteria such as the Akaike Information Criterion (AIC), Bayesian Information 
Criterion (BIC) and so on, has been used by Fraley and Raftcry (2002), Celcux, 
Martin and Lavergne (2005), Ma et al. (2006) and McNicholas and Murphy 
(2010). Heard, Holmes and Stephens (2006) in their program SplineCluster 
and Booth, Casella and Hobert (2008) maximize the posterior probability of 
partitions given the data. Arguing that the maximal posterior probability of 
partitions may be difficult to compute reliably and may not be representa- 
tive, Lau and Green (2007) suggest maximizing posterior loss, an approach 
followed by in Green (2010). However, the main drawback with the optimiza- 
tion approach is that competing partitions with similar (penalized) likelihoods 
are simply ignored. Methods based on optimization may also suffer from nu- 
meric instability, as our experience with MCLUST (Fraley and Raftery, 2002) 
suggests (explained in Section 5). When clustering is used as an exploratory 
data analysis tool to understand the heterogeneity in the data, it is often de- 
sirable and realistic to explore more than one partition and to understand how 
and why the data support multiple competing partitions. We therefore find 
the fully Bayesian approach more appealing with this rationale. In this cate- 
gory, Zhou and Wakefield (2006) implemented the Birth-Death Markov Chain 
Monte Carlo (BDMCMC) scheme initially developed by Stephens (Stephens, 
2000a), which designs a birth-death process to generate new components and 
eliminate existing ones. Medvedovic and Sivaganesan (2002), Medvedovic, Ye- 
ung and Burngarner (2004), Beal and Krishnamurthy (2006), Qin (2006), Booth, 
Casella and Hobert (2008) and Rasmusscn et al. (2009) developed Markov Chain 
Monte Carlo (MCMC) schemes under a Dirichlet-process prior. The Dirichlet- 
process prior, a popular tool in nonparametric Bayesian statistics, can induce 
sparse partitions among items (Ferguson, 1973; Antoniak, 1974) and has been 
widely used in analyses such as nonparametric density estimation (Escobar and 
West, 1995; Fox, 2009). Here, we take the fully Bayesian approach and use a 
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Dirichlet-process prior to induce a prior distribution on partitions, which helps 
us to explore different numbers of clusters and to sample from partitions under 
each number. The clustering result from the Bayesian approach is essentially an 
average of all possible solutions weighted by their posterior probabilities. 

However, two major challenges remain in clustering under Dirichlet-process 
priors. One is efficient sampling. Many MCMC methods have been developed 
under Dirichlet-process priors for conjugate priors of the parameters (such as 
those reviewed in Neal 2000), restricting the choices of priors. Alternative sam- 
pling methods have been developed, such as Gibbs samplers designed for non- 
conjugate priors (MacEachern and MuUer, 1998), several Metropolis-Hastings 
(MH) samplers under the Chinese-restaurant representation (Neal, 2000), split- 
merge sampling (Jain and Neal, 2004, 2007), another two-stage MH procedure 
under an implicit Dirichlet-process prior (Booth, Casella and Hobert, 2008), 
retrospective sampling (Papaspiliopoulos and Roberts, 2008) and splice sam- 
pling (Walker, 2007; Kalh, Griffin and Walker, 2011), both of which are de- 
veloped under the stick-breaking process representation. Several of these and 
related methods are reviewed recently in Griffin and Holmes (2010). Here, we 
develop a novel MH sampler under the Chinese-ret aurant representation. Our 
MH sampler does not introduce additional (auxiliary or latent) variables or 
tuning parameters. It also does not require separate split and merge steps, but 
rather allows for dimension changes in a single step. In addition, it is based on 
standard MH calculations and is therefore straightforward to understand and 
easy to implement. 

The other major challenge is posterior inference. Existing approaches (Medve- 
dovic and Sivaganesan, 2002; Medvedovic, Yeung and Burngarncr, 2004; Beal 
and Krishnamurthy, 2006; Rasmussen et al., 2009; Dhavala et al., 2011) at- 
tempt to make use of the posterior "similarity" matrix, whose entries are the 
posterior probability of allocating two items to the same cluster, by applying 
linkage-based clustering algorithms to this matrix. Focusing on this matrix in 
effect converts the original clustering problem into another one, while discarding 
other valuable information in the MCMC samples. We propose a two-step pos- 
terior inference procedure that involves resampling and relabeling to estimate 
the posterior allocation probability matrix, which may be used more directly in 
forming clusters and other inference. 

In this paper, we present our method DIRECT, the DIrichlet process-based 
Random-Effects model as a Clustering Tool. We describe the random-effects 
mixture model in Section 2 and the Bayesian inference in Section 3, which in- 
cludes a novel MH MCMC algorithm for sampling partitions under the Dirichlet- 
process prior, as well as the two-step posterior inference procedure. We examine 
the performance of our method through simulation studies in Section 4. We ap- 
ply our method to the time-course microarray gene expression from the Notch 
experiment in Section 5. Compared with SplineCluster (Heard, Holmes and 
Stephens, 2006) and MCLUST (Fraley and Raftery, 2002, 2006), our method 
appears to be more accurate and sensitive to subtle differences in different clus- 
ters, in both simulation studies and the real application. In addition, the analysis 
of the real data reveals several novel insights into the transcriptional dynamics 
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after the pulse of Notch activation. We summarize and discuss tlie features of 
our method in Section 6. 

2. RANDOM-EFFECTS MIXTURE MODEL 

Consider N genes measured at J time points in each of the R rephcates. 
Let Mijr, i = 1, N, j — 1,...,J, r = l,...,R, be the measurement 
for the i-th gene at the j-th time point from the r-th rephcate. The J time 
points may be unevenly distributed. We assume that there are no missing 
data. We use a random-effects mixture model to describe the heterogeneity 
in replicated time-course data. Following the standard mixture model formula- 
tion with a known number of mixture components, K, we assume that Mi — 
{Mill, ■ ■ ■ , Miiji, . . . , Miji, . . . , Mijiij^ are independent and identically distributed 
realizations drawn from a mixture distribution with K components and a set of 
mixing proportions W}., k = 1, . . . , K . The superscript T represents "transpose" 
and ensures that Mi is a column vector. The density of Mi can be written as 
a weighted average: 

K 

f{M,\&, S) = 5] Wkfk{M,\e\ S^^), 
fc=l 

where = (0^, . . . , 0^) and S = (S"'^, . . . , S^) are parameters of the mixture 
distribution, with component-wise mean vector 0*^ and covariance matrix S*^, 
k = 1, . . . , K . Whereas it is possible to define a cluster by more than one mix- 
ture component, for presentation purposes we consider here the case where one 
mixture component defines a cluster and use "mixture component" and "clus- 
ter" interchangeably. Let Zi denote the cluster membership for the z-th gene. 
Then, 

Pr(Z,; = k\w,&,-E)^Wk, 

where w is the set of mixing proportions. Following the notation in Stephens 
(2000a) and denoting the data by M = {Mi, . . . , M^), we define the posterior 
allocation probabilities as Pr{Zi = k\M), i = 1, . . . , N and k = 1, . . . ,K, which 
form the posterior allocation probability matrix P of dimension N x K. We 
aim to estimate P as part of the inference and to form clusters based on the 
estimated P, using, for instance, the most likely allocation. 

Inspired by variance components approaches (Searle, Casella and McCulloch, 
2006) and random-effects models frequently used in longitudinal studies (Dun- 
son, 2010), we constrain the covariance matrix of each mixture component by 
attributing the total variability to three sources: clustering, sampling across 
multiple time points (or more broadly speaking, multiple experimental condi- 
tions), and sampling a limited number of replicates. Whereas the first source of 
variability is due to "grouping" of the genes, the latter two are defined by the 
design of the time-course experiment. Specifically, if the i-th gene is sampled 
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from the fc-th mixture component (i.e., Zi — k), the random-effects model can 
be written as follows: 

M,,r = e'] + cl>'i + 4+e^^,, (2.1) 

where 





E(M„-,) 








^iidN(0,A^) 


4\{Z^ 




--iidN(0,A^) 




=fe,A^} 


--iidN(0,A^) 



In this formulation, Q'j represents the "true" value (fixed effect) at the jth time 
point, 0*^ the within-cluster random effect, t^j the cross-experimental-condition 
random effect, and e^j-^ the replicate random effect. Here, the experimental con- 
ditions are time points. We assume that random effects t^j and e^j^ are 
independent across clusters and of each other. Each of the three random ef- 
fects has a corresponding variability term: A^ is the within-cluster variability, 
A^ the cross-experimental-condition variability, and A^ the residual variability. 
The three types of variability are all component specific. 

Given cluster membership Zi — k, replicated measurements of the i-th gene. 
Mi, follow a multivariate normal distribution: 

M,|{Z, = fc, e^ A^, A^, A^-} ^i„d Nj«(0^gg, S^gg), 

which has aggregated mean vector ©agg — i®'' ■. ■ ■ ■ i ®^ )'^7 where 0*^ repeats 
R times, and aggregated covariance matrix S^gg whose entry is 

Cov(M,,„ M,,v') = + Ai:l(j = j') + A^l(j = /,r = /). (2.2) 

In addition. Mi and M where i ^ j, are independent of each other. 

Parameters of interest under this random-effects mixture model include the 
number of mixture components, K, component-specific parameters 0*^, A^, Aj, 
and A^, where fc — 1, . . . , and posterior allocation probability matrix P of 
dimension N x K. 

3. BAYESIAN INFERENCE 
3.1. The Dirichlet-process prior 

As mentioned in Section 1, Dirichlet processes help to partition the parameter 
space without prior knowledge of the number of partitions, K, and thus provide 
a coherent framework for directly estimating K from data and for sampling in 
a parameter space of variable dimensions. Denote the parameter of interest for 
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each gene by 7^, which, in our case, may include a mean vector 0^ and three 
terms of variabihty, namely A^^, A^-i and A^j, such that 

M,~i^(7,), i = l,...,iV, 

where F represents a distribution, which is a multivariate normal distribution 
in our case. We assume that 7jS follow a random distribution G, which is in 
turn a random draw from a (compound) Dirichlet process, denoted as follows: 

j,^G (3.1) 
G-DP(a,G'o), a>0, (3.2) 

with base distribution Gq (continuous in our case) , which describes how values in 
the space are generated, and concentration parameter a, which is non-negative. 
Note that 7jS are identically distributed, but not necessarily independent. The 
dependence among them under the Dirichlet process specifically refers to their 
values being clustered; that is, some 7jS may take on identical values as some 
other 7jS. 

Indeed, the Dirichlet process describes a mechanism by which clustered pa- 
rameters 7j may be simulated. We can generate a realization for one of them, 
say, 'J I, from Gq. The value of 72 may be identical to 7]^, with probability 
1/(1 -|- a), or an independent realization also from Go and different from 7^^, 
with probability a/{l + a). Generally, having generated n realizations, the value 
of the n + 1st realization follows the following distribution (Antoniak, 1974): 

Pi-(7„+i=7l7i,---,7„,«) = <^ «+" ' ')'ei7i,---,7„l (3.3) 

7^ {7i,---,7„} 

where the indicator function 1(7, = 7) takes value 1 if 7j — 7 and otherwise. 
In other words, the probability of 7„+i being identical to one of the existing 
values is proportional to the number of times this value has already shown up. 
This sampling process is also known as the Chinese restaurant process (reviewed 
in Pitman 2006), a useful representation for Neal (2000) to derive the Metropolis- 
Hastings sampling procedures, which are explained in the next section. 

The sampling distribution above induces a distribution on the partition of 
the N values, 7^^, . . . , 7^, with a random number of partitions, K. Specifically, 
the partition distribution is with respect to the cluster memberships Zi, i = 
1, . . . , iV, as weU as K (Antoniak, 1974): 

Pr(Zi, . . . , ZN,K\a > 0) = ^ .n ""^ n(^' " 1)'' (^.4) 

r(a + N) fj^ 

where Ni is the size of the l-th cluster, and 

PiiZi = ■■■ = Zn,K = l\a = 0) = I. (3.5) 
We use this distribution as the prior in our Bayesian inference. 
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As a measure of "concentration" , very small a leads to a small probability of 
taking on a new value in the Dirichlet process, as Eq. (3.3) suggests, and hence 
to the probability mass being concentrated on a few distinct values, as Eqs. (3.4) 
and (3.5) suggests. As a — )■ 0, 7jS are identical, which correspond to a single 
draw from the base distribution Gq. On the other hand, large a leads to a large 
probability of taking on new values in the Dirichlet process of Eq. (3.3) and 
an appreciable probability for having a range of distinct values in Eq. (3.4). As 
a — > oo, 7jS are all different and form an independent and identically distributed 
sample from Gq. Therefore, a effectively controls the sparsity of partitioning (or 
clustering). 

We note that Eqs. (3.3)-(3.5) characterize the canonical Dirichlet process with 
parameter a, denoted DP(a), for an arbitrary space, as Antoniak (1974) de- 
fined it. The representation in Expression (3.2), which we consider a compound 
Dirichlet process, includes the additional information on how the elements of 
the space arise: they are realizations of the base distribution Gq. 



3.2. A Metroplis- Hastings sampler for cluster memberships 

The key step in the MCMC algorithm is sampling partitions, specifically, cluster 
memberships Zi, under the Dirichlet-process prior. We develop a Metropolis- 
Hastings sampler that allows non-conjugate priors for parameters and efficient 
mixing. 

Similar to Ncal (2000), we design the MH procedure to sample each Z,; during 
an MCMC update. Let the current value of Zi be z' , which, together with all 
the other Zj, gives the current number of clusters as K = k' . We propose a 
new value z* for Zi, which gives the proposed value k* for K. Let ^ be the 
parameter vector of interest for the random-effects mixture model under the 
Dirichlet-process prior, such that 

€ = {K, 01, ... , &K,^4>^, ■ ■ ■ , ^4>^ , Ar^, . . . , Ar^, Ac^, . . . , Ae^, Zi, . . . , ZN,a}. 

We accept the proposal with probability min(l,iJ), where H is the Hastings 
ratio computed as follows: 

^ ^ njZ, ^ z*) g{Z, ^z'\Z,^z*) 

Tr{Zi = z') g{Zi = z*\Zi = z') 
^ Pr{M,\Z, = z*, ■) Pr(zi, . . . , z* , . . . , zn , k*\a) g{z'\z*) 

Pr(M,|Z, = z', •) Pr(zi, . . . , z', . . . , z^, k'\a) g(z*|z') 
^ Pr(M,|z*,-)Pr(z*,fc*|z_„a)ff(z^|z*) 

Pr(M,|z',-) Pr(z',fc'|z_„a) 5(z*|z')' 

where • refers to current estimates of parameters in ^ other than Zi, and z_i 
denotes the cluster memberships of all genes except for the i-th one, which do 
not change when we update Zi. 

Under the Dirichlet-process prior, we can compute the conditional probability 
Pi{z' ,k'\z_i,a) as in Proposition 1: 
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Proposition 1. Consider N values drawn from a Dirichlet process with con- 
centration parameter a > 0. These values can be partitioned into K clusters, 
where K is a random variable, with Zi, i = 1,...,N, indicating the cluster 
membership. Then the following conditional probability holds: 



Pi{Z, = z,K = k\Z^, = z^„a) = 



. N-l+a ' 



Zi is not in a singleton cluster 
Zi is in a singleton cluster ' 

(3.7) 



where Z^i, with value z_i denotes the cluster memberships, excluding the i-th 
gene, and is the size of the z-th cluster. 

Proof. See Appendix A. □ 

Alternatively, Ncal (2000) derived Eq. (3.7) first under the finite mixture 
model with a fixed number of components K, and then let K go to infinity. 

Neal (2000) then proposed an MH procedure, using the conditional probabil- 
ity in Eq. (3.7) as the proposal distribution g, which led to a simplified Hastings 
ratio: 

Pr(M,|z*,.) 

The main problem with this MH sampler is slow mixing: because the probability 
of a move is proportional to the size of the cluster, the Markov chain can be 
easily stuck, especially when there exist one or a few large clusters. For example, 
consider N = 200 and current clusters 1-3 of size 185, 10 and 5, respectively. 
A gene currently allocated to cluster 1 may be much more similar to cluster 
3, implying a high likelihood ratio as in the simplified Hastings ratio (3.8). 
However, the probability of proposing such a favorable move from cluster 1 to 
cluster 3 is only 5/(199 -I- a), where a is usually small to induce a parsimonious 
partition. The probability of moving a gene to a previously nonexistent cluster 
is a/(199 -I- a), which can be even smaller. 

We develop a novel MH MCMC strategy to deal with poor mixing of Neal's 
MH sampler. Our proposal distribution for a cluster membership is discrete 
uniform on the integer set from 1 to k' + 1, excluding the current cluster the 
gene belongs to, where k' is the number of existing clusters. This proposal 
distribution forces the proposed cluster membership always to be different from 
the current one, and makes the Markov chain move to a new or small cluster 
more easily. Whether to accept the proposal or not depends on the Hastings 
ratio, which needs to be recalculated as in Proposition 2. 

Proposition 2. For cluster membership Zi with current value z' , if proposal z* 
is generated from a discrete uniform distribution over the integer set {1, . . . , z' — 
1, z'-t-l, . . . , fc'-f 1}, where k' is the current number of clusters, then the Hastings 
ratio takes on values as listed in Table 1, where four cases, including generation 
of a new cluster and elimination of an existing cluster, are considered. 
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Table 1 

Hastings ratio for four cases under the proposed Metropolis-Hastings sampler for cluster 
membership Zi with current value z' and proposed value z* . k* and k' are the number of 
clusters after and before the proposed move, respectively. 





Current Cluster 
A Singleton 


Proposal 
An Existing Label 


k* - k' 


Hastings Ratio 


1 

2 
3 

4 


Yes 
Yes 
No 

No 


Yes 
No 
Yes 

No 


-1 



1 


Vr{Mi\z',-) ]V_* fc' 
Pr(Mi|z',-) a fc'-l 
Pr(Mi|z*,0 
Pr(Mi\z' ,■) 
Pr(Mi|z*,-) N^t 
Pr(Mi|z',-) JV^/-1 
Pr{JVfi|z*,-) a k' 
Pr{Afi|z',-) JV,,-1 fc' + l 



Proof. The proof of this proposition can be found in Section 1 of the Supple- 
mental Material. □ 



3.3. Other prior distributions 

The base distribution Gq specifies the prior on the cluster mean vector 0^, each 
of the three types of variability A^'^, Xr'^ and Ae*", for all k. We use a uniform 
distribution on [0, u] as the prior for the As, and experiment with different values 
of the upper bound u. Values of u are guided by the data. 

We experiment with three options for 0^: (i) a zero vector of length J, where 
J is the number of time points. This is a natural choice for our data considering 
that the relative gene expression level on the log2 scale is not much different 
from 0; (ii) a realization generated from an Ornstein-Uhlenbeck process (Mer- 
ton, 1971). An OU process has four parameters: the starting value, the mean 
and variation of the process, and the mean-reverting rate. We therefore specify 
the normal distribution of the starting value, and the normal distribution of the 
process mean, the uniform distribution of the process variation, and the gamma 
distribution for the mean-reverting rate; and (iii) a realization generated from 
a Brownian motion with drift. This process has three parameters: the starting 
value, the mean and the variation (Taylor and Karlin, 1998). Similarly, we spec- 
ify the normal distribution of the starting value, and the normal distribution 
of the process mean, and the uniform distribution of the process variation. Val- 
ues of the parameters in these distributions are again guided by the summary 
statistics of the data. 

For the concentration parameter a, we experiment with two options: (i) a 
Gamma prior with the shape and rate parameters, which can be updated by 
a Gibbs sampler, as described in Escobar and West (1995); and (ii) a uniform 
prior on [0,u'], where u' can be different values, which is updated by an MH 
sampler (see Section 2 of the Supplemental Material) . 
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3.4. The MCMC algorithm for i 

The complete MCMC algorithm for sampling £ consists of two major steps in 
each iteration: 

Step 1. For each i from 1 to N , update Zi using the MH sampler described 
above; 

Step 2. Given the partition from Step 1, update other parameters in ^ 
using Gibbs or MH samplers. Details of this step are in Section 2 of the 
Supplemental Material. 

If the total number of MCMC iterations is S, then the time complexity of this 
MCMC algorithm is roughly 0{SJR{AN + K)), where 4 comes from the steps 
required in the MH sampler described above, such as generating a proposal, 
computing the likelihoods and the Hastings ratio. 

3.5. Two-step posterior inference under the Dirichlet-process prior 

For probabilistic clustering, wc would like to estimate the posterior allocation 
probability matrix P of dimension N x K with entries pik — Pr{Zi — k\M), 
each of which is the probability of the i-th gene belonging to the fc-th clus- 
ter given the data. This matrix is not part of the parameter vector ^ and is 
therefore not sampled during MCMC. Below, we propose resampling followed 
by relabeling to estimate P from H MCMC samples of ^, while dealing with 
label-switching (Stephens, 2000b). 

1. Resampling; Let Q^''^ of dimension A'' x K'''^\ whose entries are g^-^'*, 
/i = 1, . . . , if, be the posterior allocation probability matrix from the h-th 
MCMC sample with arbitrary labeling. The resampling step builds upon 
an alternative representation of the Dirichlet process as an infinite mix- 
ture model (Ncal, 2000; Green, 2010). Specifically, for a Dirichlet process 
defined in Eqs. (3.1) and (3.2) with concentration parameter a and base 
distribution Gq, an infinite mixture model representation corresponds to 
taking the limit in the finite mixture model below, letting K ^ 00 and 
a* 0, such that a*K -> a (Green, 2010): 

Yk^Go, k^l,...,K 
{wi, . . . , wk) ^ Dirichletif (a* , . . . , a*) 
Pi-(7. =7fe) = Pi-(^. 

Conditional on the h-l\v MCMC sample, the mixture model for the data 
becomes finite: 

{w'^^\ . . .,wf,,,) ^ Dirichlet^(M(a(''\ • • • ,a(^)) 
Pr(zf^ -/fclujC')) =ii;f) 
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Then, the posterior probabihty q^f. can be sampled from the following 
distribution using the h-th MCMC sample 

cx7Vjfl(M, I )Dirichlet^<.) (4" V^"),--- 

where mixing proportion w^!^^ is generated from a (conditionally) finite 
Dirichlet distribution. The time complexity of this step is roughly 0{H{NJR+ 
K)). 

2. Relabeling: Labels in Q^''-*, h — \, . . . ^ H , oi dimension N x K'^^\ are ar- 
bitrary: for example, cluster #2 in Q*^''-' does not necessarily correspond to 
cluster #2 in Q*-*-*, where s ^ t.To deal with arbitrary labeling (also known 
as "label-switching"), we apply the relabeling algorithm from Stephens 
(2000b) (Algorithm 2 in that paper) to matrices Q to "match" the labels 
across MCMC samples. The dimension of Qs are set to be x i^max, 
where i^max is the maximum number of clusters from all recorded MCMC 
samples. We fill in matrices of lower dimensions with Os such that all Qs 
have the same dimension. Stephens' relabeling algorithm then finds a set of 
permutations, one for the columns of each Q, and the resulting matrix P, 
such that the KuUback-Leibler distance between P and column-permuted 
Qs is minimized. Details of our application, which also implements the 
Hungarian algorithm (a.k.a., Munkres assignment algorithm; Kuhn 1955; 
Munkres 1957) for minimization, can be found in Section 3 of the Supple- 
mental Material. If L is the number of iterations for the relabeling step 
to achieve convergence, then the time complexity of this step is roughly 
0{LH{NJR + K^)), as the time complexity of the Hungarian algorithm 
is 0{K^) (Munkres, 1957). 



4. SIMULATIONS 



We investigate the performance of our MH MCMC algorithm, and compare the 
clustering performance of our method with MCLUST (Fraley and Raftcry, 2006) 
and SplineCluster (Heard, Holmes and Stephens, 2006) on data sets simulated 
from multiple settings, each with a different number of clusters and different 
values of variabilities. The size of each data set is comparable to the number of 
differentially expressed genes we identify from the real time-course data, which 
we introduced in Section 1 and will describe in detail in Section 5: the number 
of items A'' is between 100 and 200, the number of experimental conditions (time 
points) J is 18, and the number of replicates R is 4. The last two values are 
identical to those of the real data. However, to keep track of the parameters for 
individual clusters, we consider only 6 clusters instead of the 14 or 19 clusters 
our method infers for the real data (Section 5). 

For each cluster, we simulated data from a multivariate normal distribution. 
Specifically, we generated the mean vector from an Ornstein-Uhlenbeck (OU) 
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Table 2 

Key parameter values used in four sets of simulations. Ten data sets were simulated under 
each setting. The true number of clusters is 6. True standard deviations of the three types of 
variability (within- cluster variability, cross-experimental-condition variability, and residual 

variability) are given. Size refers to the number of items simulated for each cluster. 
Standard deviations used in Simulation #1 are close to some of the clusters inferred for the 

real time-course data. 



Standard Deviation 





Witliin-Ciuster 


Expt. Cond. 


Resid. 




Simu. (Reps) 




x/a7 


Vk 


Size 


#1 (10) 


6 0.05 


0.01 


0.2 


80 




0.1 


0.05 


0.2 


20 




0.1 


0.05 


0.2 


10 




0.1 


0.05 


0.1 


10 




0.2 


0.1 


0.2 


70 




0.5 


0.1 


0.6 


10 


#2 (10) 


6 0.01 


0.5 


0.5 


20 




0.1 


0.5 


0.5 


20 




0.1 


0.5 


0.5 


20 




0.5 


0.5 


0.5 


20 




0.5 


0.5 


0.5 


20 




1 


0.5 


0.5 


20 


#3 (10) 


6 





0.26 


80 










0.35 


20 










0.35 


10 










0.25 


10 










0.50 


70 










1.20 


10 


#4 (10) 


6 





1.01 


20 










1.1 


20 










1.1 


20 










1.5 


20 










1.5 


20 










2.0 


20 



Table 3 

Comparison of methods on simulated data in terms of the corrected Rand Index (Hubert and 
Arabic, 1985) to assess clustering accuracy: the higher the corrected Rand Index, the closer 
the inferred clustering is to the truth. Each cell displays the mean (standard deviation in 
parentheses) of the Rand Index over the 10 data sets simulated under each setting 



Simu. 


True K 


DIRECT 


SplineCIuster 


MCLUST 


#1 


6 


0.99 (0.01) 


0.84 (0.02) 


0.60 (0.13) 


#2 


6 


0.69 (0.08) 


0.47 (0.10) 


0.71 (0.06) 


#3 


6 


0.99 (0.01) 


1.00 (0.00) 


1.00 (0.00) 


#4 


6 


0.95 (0.04) 


0.47 (0.00) 


0.97 (0.03) 



process with three parameters, which are the initial value, the overall mean, and 
the mean-reverting rate. We constructed the covariance matrix as specified in 
Eq. (2.2) with true values of the three types of variability (Table 2). In Simula- 
tions #1 and #2, all three types of variability are non-zero, with Simulation #2 
having more extreme within-cluster variability in some clusters. In particular, 
the level of different types of variability in Simulation #1 is largely comparable 
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Table 4 

Comparison of methods on simulated data in terms of the number of non-singleton (NS) 
clusters and the number of singleton (S) clusters inferred under each method. Each cell 
displays the mean (standard deviation in parentheses) number of clusters over the 10 data 
sets simulated under each setting. 



Simu. 


True K 


DIRECT 


SplineCluster 


MCLUST 


NS 


S 


NS 


S 


NS 


s 


#1 


6 


6.2 (0.4) 


1.7 (1.1) 


7.3 (0.5) 


0.0 (0.0) 


12.0 (2.2) 


0.0 (0.0) 


#2 


6 


7.5 (1.4) 


19.6 (7.2) 


3.8 (0.6) 


0.2 (0.4) 


7.7 (1.1) 


0.1 (0.3) 


#3 


6 


6.2 (0.6) 


0.6 (0.5) 


6.0 (0.0) 


0.0 (0.0) 


6.0 (0.0) 


0.0 (0.0) 


#4 


6 


6.1 (0.3) 


2.8 (2.2) 


3.0 (0.0) 


0.0 (0.0) 


6.0 (0.0) 


0.0 (0.0) 



to that of 6 of the 14 clusters our method infers for the real time-course data 
(Section 5). In Simulations #3 and #4, only the residual variability is non-zero, 
with Simulation ^4 having high variability in some clusters. The simplified co- 
variance structure in the latter two simulations has been adopted in SplineClus- 
ter and other methods (Medvedovic and Sivaganesan, 2002; Medvedovic, Yeung 
and Burngarner, 2004; Qin, 2006). Since SplineCluster and MCLUST allow only 
one replicate per item, we average over the replicates in simulated data and use 
these sample means as input for SplineCluster and MCLUST, and use default 
settings in both programs. Also note that neither DIRECT or MCLUST assumes 
temporal dependence, whereas SplineCluster does. 

Table 3 summarizes the performance of DIRECT and compares it to that 
of SplineCluster and MCLUST. Correctly inferring the number of clusters is 
key to the overall performance: when the inferred number of clusters is close to 
the truth, all three methods manage to allocate most of the items to the right 
clusters and thus achieve a high corrected Rand Index, and vice versa (Tables 3 
and 4). Below we discuss the performance of each method in turn. 

DIRECT recovers the true clustering consistently well in all the simulations, 
obtaining high accuracy of cluster assignments of individual items, which is 
reflected in the high corrected Rand Index (Table 3) . Accuracy and consistency 
come from recovering the true number of (non-singleton) clusters, as indicated 
in Table 4. This good performance persists even when the data were simulated 
under the "wrong" model (Simulations #3 and #4). However, DIRECT tends 
to produce singleton clusters, when those singletons are simulated from clusters 
of high variation (Table 4). 

MCLUST achieves the highest level of accuracy among the three methods in 
three out of the four simulations. However, its performance is much worse than 
DIRECT and SplineCluster in Simulation #1: MCLUST tends to infer a higher 
number of clusters with large variability (Table 4) . 

In contrast, SplineCluster tends to infer fewer clusters for more heterogeneous 
data. The dependence structure in Simulations ^3 and ^4 is in fact the same 
as that in SplineCluster. However, while SplineCluster infers the number of 
clusters correctly and allocates the items correctly in Simulation #3, it infers a 
much lower number of clusters in Simulation #4, which leads to a much lower 
corrected Rand Index (Tables 3 and 4). This is because the heterogeneity in 
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Simulation is too high for SphneCluster to distinguish different clusters, 
and SplineCluster therefore settles on a more parsimonious clustering than the 
truth. This further explains the low number of clusters SplineCluster infers for 
Simulation #2, which also has high heterogeneity within each cluster (Tables 2 
and 4). 

5. APPLICATION TO TIME-COURSE GENE EXPRESSION 
DATA 

5.1. Experimental design and data pre-processing 

As explained in Introduction, gene expression data were collected using two- 
color microarrays from four independent biological replicates of Drosophila adult 
muscle cells at 18 unevenly-spaced time points (in minutes): 0, 5, 10, 15, 20, 25, 
30, 35, 40, 50, 60, 70, 80, 90, 100, 110, 120, 150, where is the start of a 5- 
minute treatment of Notch activation (Housden, 2011). Similar to other gene 
expression data, the expression measured here is in fact the relative expression 
of treated cells to control cells, evaluated as the log2 fold change. The two colors 
of the microarray were used to distinguish treated and control cells. We applied 
quantile normalization to the distributions of spot intensities of the two colors 
across all 18 x 4 = 72 arrays. Mapping of the oligonucleotide probes on the 
microarray to the Drosophila genome followed FlyBase release 4 and earlier 
for Drosophila melanogaster. After the initial quality screen we retained 7,467 
expressed genes; that is, the absolute expression levels of genes in the treated 
and control cells are detectable by the microarray. These retained genes are 
about half of the Drosophila genome. We further imputed missing values in the 
temporal profiles of these genes (see Section 4 of the Supplemental Material). 
These data were challenging to analyze as the (relative) expression levels of 
most of these genes were close to 0. To identify differentially expressed (DE) 
genes over the time course, we applied EDGE (Storey et al., 2005) to identify 
163 such genes at a false discovery rate (FDR) of 10% and 270 genes at an FDR 
of 25%. However, even among the 163 DE genes, the (relative) expression levels 
are generally very low (Fig. 1). 

5.2. Results from DIRECT 

We ran DIRECT multiple times on both data sets with different initial values. 
Each run consisted of 10,800 iterations, including 20% burn-in. MCMC samples 
were recorded every 54th iteration. These runs each took about 8 hours for 163 
genes and 12 hours for 270 genes on 2.3 GHz CPUs, including approximately 
1 hour for resampling and a few minutes for relabeling. Since the results were 
consistent across runs, we report below the results from only one run for each 
data set, averaging the inferred posterior allocation probability matrix across 
MCMC iterations and defining clusters in terms of the most likely allocations a 
posteriori. 
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Fig 2. Mean profiles (gray and black lines) of individual genes in clusters inferred under our 
DIRECT method for the 163 significantly expressed genes. Each pair of plots, starting from 
the top left panel, display the same range on the vertical axis. Each colored line is the posterior 
mean estimate of the cluster-specific mean vector. Because more genes than those allocated 
to a cluster may have been used for inference of the mean vector, the colored curves (inferred 
mean vectors) are not necessarily located amid the profiles of the genes in that cluster (e.g., 
Cluster #10, which shows a rather extreme example). Genes with black lines are analyzed in 
more detail and presented in Fig. 4- I'n, particular, the three genes with black lines in Cluster 
#11 are also allocated to Cluster #10 or Cluster #5 with a similar posterior probability (see 
Fig. 4). 



Our DIRECT method identified 14 clusters for the 163 genes. Clusters differ 
in both the mean vectors (Fig. 2) and the three types of variability (Fig. 3). The 
cluster means differ in the magnitude and timing of the maximal or minimal 
expression. Because more genes than those allocated to a cluster may have been 
used for inference of the mean vector, the colored curves (inferred mean vectors) 
are not necessarily located amid the profiles of the genes in that cluster (e.g., 
Cluster #10, which shows a rather extreme example). In terms of variability, the 
inferred clusters are homogeneous visually and numerically: the within-cluster 
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Fig 3. Posterior median estimates of standard deviations from our DIRECT program for 
the three types of variability in each inferred mixture component for the 163 significantly 
expressed genes. Colors and numbering match those in Fig. 2. 



variability is small for most inferred clusters, whereas in all clusters the majority 
of the variability left unexplained by the mixture model is the residual variabil- 
ity, which is the variability between replicates (Fig. 3). Note that, in several 
clusters such as #9, #12, and #14, the estimated within-cluster variability in 
Fig. 3 may seem higher than the clustered mean profiles would indicate (Fig. 2). 
This is because our probabilistic clustering method estimated these variability 
terms using more genes than those assigned to the corresponding cluster based 
on the highest posterior allocation probability. These additional genes may in- 
crease the within-cluster variability. This impact is particularly noticeable for 
clusters #9 and #12 in Fig. 2: the estimated mean profiles of these two mixture 
components are generally lower than the mean profiles of individual genes in 
each cluster, suggesting that additional genes used in the inference have much 
lower (relative) expression levels than these few assigned to these clusters. 
Whereas the mean profile plot (Fig. 2) and the variability plot (Fig. 3) visu- 
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FlG 4. Replicated and mean temporal profiles, as well as posterior allocation probabilities of 
six genes from the 163 gene set. These genes correspond to the black lines in Fig. 2. For 
each gene, the top plot shows the replicated (colored) and mean (black) temporal profiles. 
Coloring here indicates replicates rather than clustering. The bottom plot shows the inferred 
posterior probabilities (vertical lines) of allocating the corresponding gene to a cluster (or 
mixture component). The lengths of the vertical lines sum up to 1 in each of these three plots. 
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Fig 5. PC A plot of the posterior allocation probability matrix for 163 genes. These colors 
match those in Figs. 2 and 3. Six arrows point to the six genes also highlighted in Fig. 2 and 
examined in Fig. 4- 
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alize different features of inferred clusters, they do not display the uncertainty 
and complexity in inferred clustering. For example, gene CG6018, inferred to 
belong to Cluster 7^3 (with peak expression appearing around 100 min; very 
late response) with probability 0.51, also has a substantial probability of 0.46 
to be associated with Cluster #7 (with peak expression appearing in between 
50 and 100 min; late response); see Fig. 4. Indeed, the replicated profiles of 
this gene show similarity to the cluster mean profiles of both clusters. Our in- 
ference indicates that the temporal profile of CG6018 is better described by a 
two-component mixture distribution, sharing features with both clusters. This 
is in contrast to genes Cecropin C (or CecC) and pebbled (or peb), whose profiles 
can be adequately represented by one multivariate normal component (Fig. 4). 
Three genes CGI 0080, GG120U and CG 17508 are better described by a three- 
component mixture distribution, i.e., their expression profiles share features with 
three clusters (Fig. 4). 

We apply principal components analysis (PCA) to the posterior allocation 
probability matrix to visualize the uncertainty and complexity in clustering. 
Fig. 5 shows the scores of the probability matrix based on the first two princi- 
pal components. Since each row of the probability matrix represents the distri- 
bution of cluster allocation for an individual gene, the PCA plot displays the 
positions of individual genes relative to their allocated clusters and to other 
clusters. Genes with similar posterior allocation probabilities are located next 
to each other. Specifically, most of the genes are allocated to a single cluster 
with probability above 0.8 and stay close to each other in the same cluster on 
the PCA plot. On the other hand, genes associated with multiple clusters each 
with a substantial probability are located in between those clusters. For exam- 
ple, the aforementioned gene CG6018 is positioned between Clusters #3 and 
#7 on this plot. 

To examine the sensitivity of our method to specification of the priors, we 
experimented with different options regarding the priors described in Section 3.3. 
Specifically, we considered values of 1 and 2 for the upper bound u in the uniform 
prior for the variability parameters As, considering that the overall standard 
deviation in the data is 0.5. We tried all the three options for generating the 
mean vectors. We computed summary statistics from the data to use as the 
parameters in the OU process and Brownian motion. For example, we used the 
sample mean and standard deviation of the data at min as the mean and 
standard deviation, respectively, of the normal distribution we assume for the 
starting values of the OU process or the Brownian motion. We also compared the 
Gibbs and the MH samplers for the concentration parameter a. These different 
choices turned out not to have much impact on the results. 

To examine the sensitivity of our method to changes in the data, we applied 
DIRECT also to the larger data set of 270 genes, identified at an FDR of 25% 
by EDGE. DIRECT identified 19 clusters for this larger data set (Figs. 1-3 in 
Supplemental Material). The cluster allocation is similar to that for the 163 
genes, with the additional 107 genes allocated to most of the clusters identified 
for the 163 genes (Figs. 1-3 in Supplemental Material). 



A. Q. Fu et al./Bayesian Clustering 



23 



5.3. Biological implications 

The inferred clustering suggests roughly three stages of exhibiting upregula- 
tion: before 50 min (early response), between 50 and 100 min (late response), 
and around and after 100 min (very late response). Clusters 9 and 12 showing 
early transcriptional responses contain most of the known target genes; that 
is, Notch has a direct impact on the transcriptional changes of these genes. 
Cluster 7 showing late responses also contains 3-5 known targets (Krejci et al., 
2009), but approximately 10 other genes in this cluster may also be Notch tar- 
gets. Genes in other late or very late response clusters may be Notch targets as 
well. Together with our collaborators, we analyze data from additional experi- 
ments to examine whether this is the case (Housden et al., 2012). Furthermore, 
the very late stage was not expected based on studies of a few known target 
genes and prior to our clustering analysis. The three stages are detected also 
among genes showing downregulation (Fig. 2). Considering that Notch generally 
promotes transcription rather than represses it, these observations suggest un- 
known, complex regulation mechanisms involving interactions between different 
clusters of genes. Hypotheses on possible transcriptional regulation mechanisms 
are investigated in Housden et al. (2012). 

5.4. Results from SplineCluster and MCLUST 

For comparison, we ran SplineCluster and MCLUST on the two real data sets, 
using the average profiles and the default settings (Table 5). SplineCluster in- 
ferred only 7 clusters for both data sets, with the inferred clusters exhibiting 
a much higher level of heterogeneity than under our DIRECT method (Figs. 
4-5 in Supplemental Material). This result is consistent with its performance 
on simulated data: SplineCluster also tends to infer a lower number of clusters 
in case of high heterogeneity (Section 4 and Table 4). MCLUST inferred 15 
clusters for 163 genes, which is comparable to our DIRECT method (Figs. 6 
and 8 in Supplemental Material). However, it inferred only 2 clusters for 270 
genes and a different covariance model (Figs. 7 and 9 in Supplemental Material). 
This sensitivity of clustering to the relatively minor change in the data may have 
arisen from MCLUST trying to simultaneously select the number of clusters and 
the covariance model. Selection of the covariance model adds another layer of 
complexity to the problem of clustering, particularly when different covariance 
models considered by MCLUST, but not defined by the experimental design, 
may be competing models for the data. This may also explain the particularly 
high variability in the inferred number of clusters for simulated data in Simula- 
tion #1 (Table 4). 

6. DISCUSSION 

In this paper, we developed DIRECT, a model-based Bayesian clustering method 
for noisy, short and replicated time-course data. We implemented this method 
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Table 5 

Numbers of clusters estimated by three clustering methods: DIRECT, SplineCluster and 
MCLUST for genes identified by EDGE (Storey et al., 2005) to be differentially expressed 

over the time course. 





No. of Inferred Clusters 




163 Genes 


270 Genes 




(FDR 10%) 


(FDR 25%) 


DIRECT 


14 


19 


SplineCluster 


7 


7 


MCLUST 


15 


2 



in the R package DIRECT, which may be downloaded from CRAN (http: 
//cran. r-project.org/web/packages/). We also applied this method to ana- 
lyze the time-course microarray gene expression levels following Notch activation 
in Drosophlia adult muscle cells. Our analysis identified 14 clusters in 163 dif- 
ferentially expressed genes and assigned probabilities of cluster membership for 
each gene. The clustering results indicate three time periods during which genes 
attain peak up- or down-regulation, which was previously unknown, and sug- 
gest possibilities for the underlying mechanisms of transcription regulation that 
may involve interactions between different clusters. Hypotheses on the biological 
mechanisms are further investigated in Housden et al. (2012). Here we discuss 
several additional aspects of the clustering method. 

Our method has four main features. First, the random-effects mixture model 
decomposes the total variability in the data into three types of variability that 
arise from clustering (A^), from sampling across multiple experimental condi- 
tions (At), and from sampling a limited number of replicates (Ae). This variance 
decomposition regularizes the covariance matrix with constraints that are con- 
sistent with the experimental design. It is simultaneously parsimonious and iden- 
tifiable for the replicated data: the replicated profiles at multiple time points 
of a single gene are already informative for A^ and A^, and having at least 2 
genes in a cluster makes A^ estimable. Second, our method uses the Dirichlet- 
process prior to induce a prior distribution on clustering as well as the number of 
clusters, making it possible to estimate directly both unknowns from the data. 
Third, we have developed a novel Metropolis-Hastings MCMC algorithm for 
sampling under the Dirichlet-process prior. Our MH algorithm allows the use 
of nonconjugate priors. It is also efficient and accurate, as simulation studies 
demonstrate. Fourth, our method infers the posterior allocation probability ma- 
trix through resampling and relabeling of the MCMC samples. This probability 
matrix can then be used directly in forming clusters and making probabilistic 
cluster allocations. Simulation studies and application to real data show that 
DIRECT is sensitive enough to variability in the data to identify homogeneous 
clusters, but not too sensitive to minor changes in the data. 

Several other model-based clustering methods construct their models along 
similar lines (Celeux, Martin and Lavergnc, 2005; Ma et al., 2006; Zhou and 
Wakefield, 2006; Booth, Casella and Hobert, 2008). In fact, our model in Eq. (2.1) 
coincides with the random-effects model E3 in Celeux, Martin and Lavergne 
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(2005). However, those authors decided to focus on a sHghtly shiipler model, 
which is similar to Eq. (2.1) but without the within-component random effects 
(j>^. They based their decision on the nearly identical likelihoods of the two 
models for simulated data. Ma et al. (2006) and Zhou and Wakefield (2006) 
did not deal with replicated data and included in their model only two types 
of variability: the within-cluster variability and the variability due to multiple 
time points. Similar to us, Booth, Casella and Hobcrt (2008) worked with repli- 
cated time-course data and used random effects to account for different types of 
noise, but their partition of the total variability is not based on the experimen- 
tal design and is therefore much less straightforward. Specifically, they allowed 
for dependence among different items in the same cluster but did not explicitly 
account for the random effect due to time (or experimental condition). 

Note that our DIRECT method does not account for the temporal struc- 
ture, but rather focuses on modeling the covariance matrix. This is similar to 
the approach MCLUST takes, which applies eigenvalue decomposition to the co- 
variance matrix and considers various constraints on the decomposed covariance 
matrix (i.e., whether the shape, orientation or volume of the covariance matrix 
is identical across mixture components), although the constraints considered in 
MCLUST are not based on any experimental design. The good performance of 
our method on both simulated and real data, and of MCLUST in several cases, 
suggests that accounting for the temporal structure in the mean vectors, such 
as via splines functions as in SplineCluster or via Gaussian processes as in Zhou 
and Wakefield (2006) and others, may not be necessary. We also followed the 
approach in Zhou and Wakefield (2006) and modeled the mean vector of each 
mixture component as a Brownian motion (with drift) and, extending this idea, 
as an Ornstein-Uhlenbeck process. The clustering results such as the inferred 
number of clusters and allocation of individual genes did not change much. This 
is because these approaches impose the temporal structure on the mean vec- 
tor. Conditioning on the correct clustering, the data are directly informative of 
the cluster mean, a main parameter of interest. Incidentally, DIRECT is ap- 
plicable also in more general cases of multiple experimental conditions, where 
dependence among conditions is nonexistent, unclear or unknown. 

Similar to other MCMC methods, our DIRECT method does not aim to 
optimize the runtime. Whereas MCLUST and SplineCluster, both non-MCMC 
methods, took only seconds or at most minutes to run on the simulated and 
real data here, we ran DIRECT for hours to make sure that the results were 
consistent across different runs, which indicated that the Markov chain had 
mixed well. 

We have used only the one-parameter Dirichlet-process prior in our method. 
The concentration parameter in the Dirichlet-process prior simultaneously con- 
trols the number of clusters as well as the size of each individual cluster. The 
prior has the tendency of creating clusters of very different sizes. The posterior 
inference to generate the posterior allocation probability matrix is therefore crit- 
ical to balance out the unevenness: although certain clusters may be very small 
or very big in a single iteration, items allocated these tiny clusters are likely 
allocated to other, possibly larger, clusters over a sufficient number of MCMC 
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iterations. Nonetheless, as pointed by the Associate Editor and an anonymous 
reviewer, other exchangeable priors, such as the two-parameter Dirichlet pro- 
cess (a.k.a, the Pitman- Yor process; Pitman and Yor 1997) and many other 
extensions of the Dirichlet process reviewed in Hjort et al. (2010), may also be 
adopted under our framework. Indeed these other exchangeable priors may offer 
more flexibility and offer an important direction to extend our current work. 

Under our and Ncal (2000) 's MH MCMC algorithms, the Markov chain is 
constructed for the cluster memberships of individual items. Generation of a new 
cluster and elimination of an existing cluster are implied rather than enforced. 
In contrast, reversible-jump MCMC (Richardson and Green, 1997) and birth- 
death MCMC (Stephens, 2000a) enforce changes in dimensions by designing 
the MCMC moves around the number of clusters. Their strategy may not be 
efficient for clustering multivariate data, because even a fixed number of clusters 
may correspond to a large number of possible partitions and a large space of the 
cluster-specific parameter values. For clustering it seems more sensible for the 
Markov chain to move to the neighborhood of the "correct" number of clusters 
and to fully explore the parameter space in this neighborhood, as under Neal's 
approaches and under our method. 

Appendix A: Proof of Proposition 1 

Proof. We use the joint distribution of clustering and the number of clusters 
given in Eq. (3.4) for derivation. Let K-i be the number of clusters when the 
i-th gene is excluded. Then, 



Pr(Zi ^ z,K ^ k\Z_i = z_i, a) 



Pr(Z = z,K = k\a) 
Pr(Z_j = z_i, = fc„j|a) 

r{a)/r{a + N)a'^l\li{Ni-iy. 

T{a)/T{a + N- l)a^-^ U':=iiNs ~ 1)! 

{ j^j^^^ , Zi is not in a singleton cluster 
, Zi is in a singleton cluster 

□ 
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Supplementary material for the paper 
Bayesian Clustering of Replicated Time-Course Gene Expression Data 

With Weak Signals 
by Audrey Q. Fu, Steven Russell, Sarah J. Bray and Simon Tavare 

1. Proof of Proposition 2 in main text. For cluster membership Zi 
with current value z' and current number of clusters K = k' , we generate 
proposed value z* , which implies the proposed number of clusters k* , from 
the discrete uniform distribution over the integer set {!,..., z' — l,z' + 

1, . . . , A;' + 1}. Recall that the Hastings ratio is defined as follows: 
(N 

+J ^ ^ T^jZj = z*) g{Zi = z'\Zi = z*) 

Q tt{Z, = z') g{Zi = z*\Zi = z') 

^ ^ ^ Pr(M^|Z^ = z*,-) ^ PT{zi,...,z*,...,ZN,k*\a) ^ g{z'\z*) 
^ Y>T{Mi\Z, = z',-) VT{zi,...,z',...,ZN,k'\a) g{z*\z'y 

I 1 (1.2) = ffieft X -^middle X -f^right, 

where Mi is the data of the z-th item, • refers to the parameters of interest 
other than Zi, and g is the distribution to generate proposals. 

We consider four cases below when computing H. The likelihood ratio, 

1 ^1 the leftmost ratio in Eq. (1.1), is computed in the same way in all four cases, 

except when z* is again a new label, the component-specific parameter values 
are generated from their respective priors. Hence, we derive the middle ratio 
^\ below and the rightmost ratio subsequently. Derivation of -ffmiddie relies on 

the joint distribution of Z and K, which is Eq. (3.4) in the main text and 

in 



copied here: 



O Ff ) 

(1.3) Pr(Zi, ...,Zm, K\a > 0) = Y[{Ni - 1)!, 

^ where Ni is the size of the l-th cluster. The case where a = as in Eq. (3.5) 

^ in the main text is not considered in the derivation here, because a is also 

H sampled during the MCMC procedure and has probability of taking value 

^ 0. 

1. Item i is currently in a singleton cluster and proposal z* is the same 
as another existing label, then the singleton cluster is eliminated after 
the move. Hence, 

_a^''-\N,* + l-l)\ _N,* 

-f^ middle ~ 



a'^'iN,* - 1)! a 



2 



2. Item i is currently in a singleton cluster and proposal z* is again 
different from any other existing label, then the number of mixture 
components does not change. Hence, 

a 

-f^middle — TJ — !• 

a'' 

3. Item i is currently in a non-singleton cluster and proposal z* is the 
same as an existing label, then the number of mixture components 
does not change. 

_ a'^' {N,> - 1 - iy.{N,, + 1 - l)\ _ N,, 

-"middle — 



a'='(iV,, -l)!(iV,. -1)! iV.'-r 

4. Item i is currently in a non-singleton cluster and proposal z* is different 
from any existing label, then a new cluster of size 1 is generated: 

_ c^'+i(iV^/ - 1 - 1)!(1 - 1)! _ a 

middle 



a^'{N^,-l)\ N^,-l' 

Since the number of mixture components sometimes changes after the 
proposed move, the rightmost ratio in Eq. (1.1) is calculated as follows: 



^ g{z'\z*) ^ l/k* ^ k' 
''^^^ g{z*\z') l/k' k* 



k'/{k' - 1), Case 1 

1, Case 2 

1, Case 3 

k'/{k' + 1), Case 4 



where k' and k* are the number of mixture components before and after the 
proposed move, respectively. Putting together the three ratios gives Table 1 
in the main text. 

2. Details of the Markov Chain Monte Carlo (MCMC) algo- 
rithm. Recall that for the z-th item the data vector of repeated measure- 
ments at J time points from R replicates is 

Mi = (Mill, . . . , Man, Min, . . . , Munf , 

and the parameter vector of interest is 

^ = {K, ©1, . . . , ©A', A^^, . . . , A(^^, , . . . , At-''^, Ae"^, . . . , Xj^ , Zi, . . . , Zn, a}. 

The main text describes how cluster memberships Z and the number of 
clusters (mixture components) K are updated. Here, we describe how we 
update other elements in ^. 



3 



2.1. Updating cluster- specific parameters. To update cluster-specific pa- 
rameters, we reformulate the random-effects model for Mi as a hierarchical 
model, which includes two layers: 

• The first layer consists of independent and identically distributed repli- 
cated measurements of the i-th item at the j-th time point: 

Mijr\{Zi = k, Hij, A^} ~iid N(/Xij, A^), 

where jiij is the mean for the z-th item at the j-th time point. 

• The second layer consists of the mean vectors of all the items allocated 
to the A;-th cluster: 

= k, ©^ A^, A^ A^} ~iid NJ(©^ A^S + A^I), 

where I and S, each of dimension J x J, have 1 only on the diagonal 
and in all entries, respectively. 

We update latent variables /itj and parameters 0^, A^, A^, and A^ as 
follows: 

1. Update the mean vector of length J for each item currently allocated 
to this cluster, fi^, i = l,...,Nk, which has a multivariate normal 
distribution. 

7^(/x,|•) = 7^()udM,,0^A^,A^,A^) 

R 



cx n PiMiir, . . . , Murll^i, A,^I)p(/i,|0^ A^S + A^:I) 

r=l 

where the posterior covariance matrix and mean are, respectively, 
Sm = {i?(A,^)-^I + (A^S + A^I)-i}-\ 



Mm = Sm{(A^S + A^I)-i0*^ + i?(A^)-iy^}, 
1 ^ 



r=l 



2. Update the mean profile for the cluster, 0^^, using the average of the 
individual mean vectors. 

3. Update the three types of variability using all the data in the cluster 
with a Metropolis-Hastings sampler. Take the within-cluster variability 
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for example. Its posterior distribution is as follows: 

7r(A^|-) oc ^(A^) J]p(M,|A^, A^, A^, fi^, . . . , fi^J, 

i=l 

where vr(A^) is the prior distribution for A^, k = 1, . . . ,K. Given cur- 
rent value A^, the MH sampler generates proposal A^* from N(A0, cj^), 
where o"^ is the pre-determined variance in the normal proposal distri- 
bution, and accepts the proposal with probability min{l,p(A0*|-)/p(A(^|-)}. 
The MH sampler allows us to experiment with different prior distri- 
butions on the variabilities. 

2.2. Updating concentrate parameter a. Concentration parameter a con- 
trols the number of clusters identified and is therefore one of the key param- 
eters. Either a Gibbs or an MH sampler can be used to update a. 

• Assigning a a Gamma prior distribution with shape parameter a and 
rate parameter b, Escobar and West (1995) proposed a Gibbs sampler 
that introduces a latent variable t] to help construct the posterior dis- 
tribution. Given current value a' and current number of clusters k' , 
the distributions for r] and a are as follows: 

r]\a' ~ Beta(Q;' + l,iV), 
a\T], k' ~ ti;^Gamma(a + k' ,b — log jy) + (1 — i(;^)Gamma(a + k' — l,b — logry), 

where 

a + k' — 1 
1 — Wn N{b — logrj) ' 

• In an MH sampler, the target distribution is 

7r(a|-) oc Fr{Z, K\a)-K{a), 

where Pr(Z,i^|a) is defined in Eq. (1.3) and 7r(a) is the prior distri- 
bution of a. With current values a' and k' , and proposed value a*, the 
Hastings ratio for this target distribution is then 

r(a*) r{a' + N) /a*\^'7:{a*) g{a*\a') 
~ r(a') r(a* + N) w) TT{a') g{a'\a*)' 

where vr and g are the prior and proposal distribution, respectively. 
We accept proposal a* with probability mm{l, H). 
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3. Stephens relabeling algorithm for finite mixture model. In 

the two-step posterior inference, after resampling for each of the H MCMC 
samples to obtain posterior allocation probability matrices Q^'^) , /i = 1, • • • ,H, 
with arbitrary labeling, we apply Algorithm 2 from Stephens (2000) to 
these matrices. Step 2 in this algorithm, as pointed out in Appendix A 
in Stephens (2000), is equivalent to the assignment problem, which can be 
quickly solved using efficient algorithms, such as the Hungarian algorithm 
(Kuhn, 1955; Munkres, 1957). Thus the complete relabelling algorithm is as 
follows: 

Input: H N xK double matrices (each a posterior classification probability 
matrix) 

Output: 1 H X K integer matrix (each row a permutation of K integers) 
while fixed point not reached do 

Step 1: compute pik = jj J2h=ip['k 
Step 2: 

for /i = 1 to -fT do 

{find permutation for each h to minimize KL distance between ma- 
trices P and Q} 

compute cost matrix with the (j, l)th entry being 
CU, I) = Ell quiMO^'^} log ""^'^'"'^ 

where Vh represents the current permutation in the hth sample 

apply the Hungarian algorithm to find the optimal permutation v'f^. 
end for 
end w^hile 

4. Preprocessing of the time-course microarray gene expression 
data. 

4.1. Quality control. Recall that we measured expression levels at 18 
time points for four biological replicates in the time-course experiment. For 
each gene, a "bad" time point means that the measurements of 2 or more 
replicates are not acceptable. We discard a gene if it has 6 or more "bad" 
time points among the first 16 time points (i.e., excluding 120 and 150 min). 
This screening results in 7,467 genes, about half of the whole set of 14,569 
genes printed on the microarray. 

4.2. Imputation. Retained genes may still have missing values due to 
poor data quality at some time points or from some replicates. We imputed 
missing values before clustering, following the rules below: 
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1. Consider gene i at time t. In general, when only one out of four values 
is missing, use the median of the other three measurements at time t to 
impute the missing value. Otherwise, compute the population median 
at t. Combine the population median with the available measurements 
of the gene i and use the median of these values to impute the missing 
values for gene i at t. 

2. For a gene at the 1st and last time points, apply Rule ^1. 

3. For a gene at any other time point t, 

(a) Impute at time points where there is only one value missing. 
Apply Rule #1. 

(b) Impute at time points where two or three values are missing. Use 
the median of values at t, t — 1 and t + 1. If there is no or only 
one value at t — 1 or t + 1, use their neighboring time points, such 

as t — 2 or t + 2 and so on. 

(c) Impute at time points where all four values are missing. Use the 
median of values ait—l and t + 1. Similar to the previous step, if 
there is no or only one value aX t—1 or t+1, use their neighboring 
time points. 

5. Additional figures. 
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Cluster 11 (6 genes) Cluster 14 (2 genes) Cluster 13 (3 genes) Cluster 19 (3 genes) 




50 100 150 50 100 150 ' 50 100 150 ' 50 100 150 
Cluster 4 (8 genes) Cluster 18 (3 genes) Cluster 15 (16 genes) Cluster 10 (14 genes) 




Fig 1. Mean profiles of individual genes in clusters identified by our DIRECT for the 
270 significantly expressed genes. The non-black lines are the posterior means of the mean 
vector of each inferred component under this model, their colors matching those in Fig. 3. 
Because often more genes than those allocated to a cluster have similar features and are 
pooled together for inference of the mean vector, the non-black curves (inferred mean vec- 
tors) are not necessarily located amid the profiles of the genes in that cluster. 
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Fig 2. Postenor median estimates of standard deviations from DIRECT program for the 
three types of variability in each inferred mixture component for the 270 significantly ex- 
pressed genes. 
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PC1 (variability: 20%) 



Fig 3. The PC A plot of the posterior allocation probability matrix inferred by DIRECT 
for the 270 significantly expressed genes. The colors of the clusters match those m Fig. 1. 




Fig 4. Mean profiles of individual genes in clusters identified by SplineCluster (Heard, 
Holmes and Stephens, 2006) for the 163 gene set. 




Fig 5. Mean profiles of individual genes in clusters identified by SplineCluster (Heard, 
Holmes and Stephens, 2006) for the 270 gene set. 
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number of components 

Fig 6. Penalized log-likehhood curves under the Bayesian Information Criterion (BIC) 
from MCLUST (Fraley and Raftery, 2002) for the 163 gene set. Each combination of 
letters represents a model of the covariance matrix. 
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number of components 



Fig 7. Penalized log-likehhood curves under the Bayesian Information Criterion (BIC) 
from MCLUST (Fraley and Raftery, 2002) for the 270 gene set. Each combination of 
letters represents a model of the covariance matrix. 




Fig 8. Mean profiles of individual genes in clusters identified by MCLUST (Fraley and 
Raftery, 2002) for the 163 gene set. Also see Fig. 6. 




Fig 9. Mean profiles of individual genes in clusters identified by MCLUST (Fraley and 
Raftery, 2002) for the 270 gene set. Also see Fig. 7. 
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